-
Notifications
You must be signed in to change notification settings - Fork 18
feat: Add GenAI/LLM telemetry support for ClickHouse #259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Add --materialize-genai-fields flag to DDL tool for creating materialized columns (GenAIOperationName, GenAIProviderName, token counts, etc.) - Add --partition-by-service-name flag for multi-tenant optimization - Add genai_redaction_processor.py for PII redaction in LLM messages - Update documentation with usage examples
rjenkins
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @brightsparc, did a quick skim and this looks pretty good! I'm going to dig into the Python code tomorrow and should be able to finish review.
Question on the ddl, do you know if there has been any changes on the otel collector side in regards to schema for CH exporter with genai changes? I checked here https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/clickhouseexporter/internal/sqltemplates, but doesn't look like anyone has done any work on this yet there either. Either way excited to get this in.
|
Can you merge from main to include this PR: #256? That should allow us to schedule the status checks |
I'm not aware of any specific new fields at the table level, most of these changes are under the span attributes AFAIK. |
rjenkins
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brightsparc. I left a comment on the processor in regards to overlap with the existing redaction processor, would like to hear your thoughts on that. Additionally I'm curious as to how you are using/querying the data with these changes today. Are you using ClickStack or are you directly querying ClickHouse in some other manner?
The other big change here is the schema changes to CH. Nice job on making them "opt-in" via args on clickhouse-ddl. I'm reaching out to our friends at Clickhouse to get their thoughts on these, so give me a day or two to get back to you.
From a high level essentially there are two things we should consider.
- What is the performance impact on insert for the materialized columns and partition/order by changes. As mentioned it's opt-in, so not necessarily a blocker but there is likely additional load on the CH server and theoretically a drop in insert performance. Ideally we'd quantify that and make a note in the README.md about the perf impact. On the flip side the query performance will be better for aligned queries.
We can likely help with some of the perf analysis if you can share some code to generate synthetic spans with the related genai fields or perhaps even more useful we could modify our load generator https://github.com/streamfold/otel-loadgen to include these new fields and we can run some benchmarks.
- Does this have an impact on ClickStack? I suspect ClickStack will ignore these additional columns, so might not be an issue but we should test and verify we don't break anything with their UI.
If we find that these additional columns are very interesting from the ClickHouse folks we may want to consider more closely how we generate these (materialized or transformed natively in Rotel's CH exporter) and whether we should add these to the default OTel traces table or if their better suited in a seperated GenAI traces table.
| @@ -0,0 +1,463 @@ | |||
| """ | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brightsparc - Thanks for this processor, it seems like the bulk improvements here are some DX around handling redaction as it related items with the tool_call prefix within the "gen_ai.input.messages" and "gen_ai.output.messages"? This seems useful to me and we definitely want to add more out-of-the-box processors so we might land this as-is, however I'm curious if you looked at the generic redaction processor here: https://github.com/streamfold/rotel/blob/main/rotel_python_processor_sdk/processors/redaction_processor.py.
Seems like there is some fair amount of overlap. The redaction_processor and its configuration arguments are based loosely on the similar go processor. Would some of these changes slot into the existing redaction processor as additional configuration options? WDYT about the overlap and what was missing from the redaction processor that makes this genai_redaction_processor easier to use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I wasn't sure if there was a way to handle the same sort of genai specific redaction with the existing out of the box solution. I implemented this as more of a POC to see if I could include as part of the broader GenAI materialization of properties. Also I could see value in having an option to use presidio for more sophisticated scrubbing.
I'm happy to drop this from the PR if it's not adding value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presidio looks neat. We don't need to drop this and agreed it's probably a good idea to keep it separate for now, (like a POC as you say), so we can iterate on it without modifying or worrying about impact or accidental breakage to the generic redaction processor. I'm just wondering if longer term the capabilities will overlap and we can consolidate. With that said, we want to grow the processor library so overlap is not major concern at the moment and if we find later users are confused as to "which processor to use" we can consolidate.
| "ORDER_BY", | ||
| &get_order_by("(ServiceName, SpanName, toDateTime(Timestamp))", engine), | ||
| ), | ||
| ("PARTITION_BY", &get_partition_by(partition_expr, engine)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: We need to review performance impact on insert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal here was to be able to support multi-tenant querying of otel records, was going to do this by the service.name, but open to other/better ways of adding partioning / indexing.
| // Uses SpanAttributes['key'] syntax and casts for numeric values | ||
| const GENAI_MATERIALIZED_MAP_SQL: &str = r#" | ||
| -- GenAI MATERIALIZED columns (extracted from SpanAttributes Map) | ||
| GenAIConversationId String MATERIALIZED SpanAttributes['gen_ai.conversation.id'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to review performance impact on insert.
I wasn't aware of this project, I think it could be good to simulate a combination of various multiple turn, tool calling scenarios etc, the idea of adding the indexes was to be able quickly get to the last message in a multi-turn conversation instead of having to query the whole large json object.
I initially started this PR by looking to transform the objects with a genai specifc exporter, but decided to go this route instead which put the heavy lifting back onto clickhouse which felt like a more optimal approach, but would be interesting to benchmark. |
Thanks for the context. I will follow up with more thoughts on the schema changes in the coming days. RE: the otel-loadgen, if you want to open a PR to simulate a combination of various multiple turn, tool calling scenarios that would be 👍 |
|
Hey @brightsparc we're +1 on landing this but would like to get some benchmarks on our side first. We have a harness to run the benches, but would you be willing to open a PR for https://github.com/streamfold/otel-loadgen that generates synthetic data for materialized columns. We can then run benchmarks on our end to see what the performance looks like, update the README.md with any relevant data and get this merged. Also, it might be nice to do a blog post about if you're interested. We should be able to configure ClickStack to view the new materialized columns and we could discuss some use cases if you're interested in collaborating on a post. |
it might be worth using a public open dataset that has a number of multi turn inputs / outputs. A good example is the sales force APIGen dataset that has 5k examples for multi turn. They also have an older dataset of 60k examples that were typically single turn although with multiple tool calls. you could use this as a source to create a decent column of data that has input and output messages. |
This PR adds support for GenAI/LLM observability workloads by extending the existing ClickHouse traces exporter with materialized fields optimized for LLM telemetry queries.
Changes
ClickHouse DDL Tool (
clickhouse-ddl)Added two new flags for traces table creation:
--materialize-genai-fields: Creates MATERIALIZED columns that extract GenAI semantic convention attributes fromSpanAttributes, enabling efficient querying without parsing JSON on every query.--partition-by-service-name: Optimizes multi-tenant deployments by partitioning data by(ServiceName, toDate(Timestamp))and ordering by(ServiceName, Timestamp).Materialized GenAI Fields
When
--materialize-genai-fieldsis enabled, the following columns are automatically extracted:GenAIOperationNameGenAIProviderNameGenAIRequestModelGenAIResponseModelGenAIInputTokensGenAIOutputTokensGenAISystemFingerprintGenAILastInputMessageGenAILastOutputMessagePython Processors
Added
genai_redaction_processor.pyfor redacting PII in GenAI message attributes:gen_ai.input.messagesgen_ai.output.messagesFeatures:
[REDACTED]Usage
Creating Tables
# Create traces table with GenAI materialized fields and multi-tenant partitioning clickhouse-ddl create \ --endpoint http://localhost:8123 \ --database otel \ --traces \ --materialize-genai-fields \ --partition-by-service-nameRunning with PII Redaction
Querying GenAI Data
Files Changed
.gitignore- Added Python cache patternssrc/bin/clickhouse-ddl/ddl_traces.rs- Added GenAI materialized columns generationsrc/bin/clickhouse-ddl/main.rs- Added CLI flagssrc/bin/clickhouse-ddl/README.md- Updated documentationsrc/init/args.rs- Minor updatessrc/exporters/clickhouse/mod.rs- Minor updatesrotel_python_processor_sdk/processors/genai_redaction_processor.py- New PII redaction processorrotel_python_processor_sdk/python_tests/genai_redaction_processor_test.py- Tests for redaction processor